Summary for Modelling

Pointers for different ML models

lruolin
11-23-2021

OVERVIEW

This is a summary post on all the things to take note of when dealing with different models. The text summarises what I read from the three books (links shown in reference)

KEY SUMMARY

  1. Linear models: Go to as a first algorithm to try. Good for large datasets.
  2. k-nearest neighbours: for small datasets, good as a baseline.
  3. Decision trees: Fast, don’t need scaling of data, easily visualized and explained.
  4. Random forests: Don’t need scaling of data, not good for high dimensional sparse data.
  5. SVM: Good for medium sized datasets with predictors with similar meaning. Require scaling fo data, must carry out parameter tuning.
  6. Neural networks: Sensitive to scaling of data and to choice of parameters. Can build very complex models, but need a long time to train.

EDA

PREPROCESSING

Check for errors/artifacts

Missing values

The steps in recipe package to handle missing data are:

Centering and scaling

The steps in recipe package to handle missing data are:

Normalization (Z-scores) should only be used on normally distributed variables.

For scikit-learn, the available ways are: - StandardScaler (mean = 0, variance = 1) - RobustScaler (median and quantiles are used, ignoring outliers) - MinMaxScaler (all features are exactly between 0 and 1) - Normalizer (feature vector has Euclidean length of 1)

Resolve skewness

The steps in recipe package to handle missing data are:

Outliers

Data reduction/Feature extraction

The steps in recipe package to handle missing data are:

Removing Predictors

The steps in recipe package to handle missing data are:

Multi-collinearity

The steps in recipe package to handle missing data are:

DATA SPLITTING

Training, Testing

Resampling

k-fold cross-validation

bootstrapping

tuning for model performance

SUPERVISED LEARNING

Regression

Linear regression

OLS

Preprocessing
Tuning Parameters

No tuning parameters

Shrinkage methods for linear regression

Ridge Regression
Preprocessing:
Tuning parameter:
Lasso Regression
Tuning parameter
Preprocessing:

Non-Linear regression

Neural Networks

Preprocessing:

SVM (Support Vector Machines)

k-nearest neighbours

Non-Linear regression

Decision Trees

Decision trees can be applied to both regression and classification problems.

Preprocessing:
Tuning parameters:

Random Forests

Preprocessing
Tuning

Classification

k-nearest neighbour

Preprocessing:
Tuning:
  1. Number of neighbours

Logistic Regression

Logistic regression models the probability that Y belongs to a particular category.

Generic 0/1 encoding is used for outcome (eg 0 = no, 1 = yes for defaulting on credit)

log(odds of defaulting) = bo + b1X

If X = balance, b1 = 0.0055, one unit increase in balance is associated with an increase in the log odds of defaulting by 0.0055 units.

If p-value is significant, then there is an assocation with balance and probability of default.

Preprocessing:
Tuning parameters:

Support Vector Classifier

Preprocessing:
Tuning parameters:

SVM (Support Vector Machines)

Preprocessing:
Tuning parameters:

Decision Trees

Decision trees can be applied to both regression and classification problems. The decision tree has three basic components: the internal node, the branch, and the leaf nodes. Each terminal node represents a feature (predictor), and the link represents the decision rule or split rule, and the leaf provides the result of the prediction.

Preprocessing:
Tuning parameters:

Random Forests

Preprocessing
Tuning

Neural Networks

UNSUPERVISED LEARNING

Clustering

PCA

Preprocessing:

k-means clustering

Preprocessing:
Tuning parameters:

Hierarchical Clustering

Preprocessing:

METRICS FOR PERFORMANCE

Regression

Classification

Reference:

Citation

For attribution, please cite this work as

lruolin (2021, Nov. 23). pRactice corner: Summary for Modelling. Retrieved from https://lruolin.github.io/myBlog/posts/20211123 - Summary for modelling/

BibTeX citation

@misc{lruolin2021summary,
  author = {lruolin, },
  title = {pRactice corner: Summary for Modelling},
  url = {https://lruolin.github.io/myBlog/posts/20211123 - Summary for modelling/},
  year = {2021}
}